Lecture 7: A Massive and Messy Marketing Survey

Marketing Overview

Marketing Versus Advertising

Customer Research

Surveys

Difficulties with Surveys

Are Surveys Representative of the Population?

Case Study: A Massive, Messy Marketing Survey

Setting: A real world marketing analytics project with a big company.

The Challenges:

The Company

The Products

The Surveys

Red Flags

All of these factors were potential concerns. While we could make some small recommendations about the design of the survey, the company was committed to the overall approach. They were more interested in understanding the results for the data they had than in revamping the survey.

Similar Surveys, But Not Quite Unified

We noticed a variety of differences among the surveys for the 3 product categories:

Example of Small Differences

In which kind of store are you most likely to shop for the product?

  1. A grocery store.
  2. A candy store.
  3. A pharmacy.
  4. A convenience store.
  5. A bakery.

However, the Salty Snacks products are not sold in candy stores or bakeries. The same question would have different multiple choice options in each survey.

To Unify or Not to Unify?

Ultimately, you have to decide whether the advantages of a unified survey outweigh the advantages of collecting more specific information about certain products.

Survey Questions: Data Conversion

During the past 6 months, how often did you consume this brand of snack food?

These values are text entries. We’d like to convert them into a numeric form. What is the best way to proceed?

Annualizing the Rates of Consumption

We’ll convert all of the values to their corresponding yearly rates of consumption.

Writing a Conversion Function

annualize.consumption <- function(x) {
    library(plyr)
    written.rates <- c("Daily", sprintf("%d times per week", 
        6:2), "1 time per week", sprintf("%d times per month", 
        3:2), "1 time per month", "2 times in the past 3 months", 
        "1 time in the past 3 months", "1 time in the past 6 months")
    annualized.rates <- c(365.25 * (7:1)/7, 12 * (3:1), 
        4 * (2:1), 2 * (1))
    y <- mapvalues(x = x, from = written.rates, to = annualized.rates)
    return(y)
}

Questions Requiring Text Processing

Example: Measuring Advocacy

Have you recommended this brand in the past month?

+ 1:  I recommend this brand all of the time.
+ 2:  I recommend it when people ask about it.
+ 3:  I am neutral.
+ 4:  I recommend against trying this brand when I'm asked.
+ 5:  I recommend against trying this brand all of the time.

If we want to classify the respondents as advocates, we could use the following conversion:

recommendation.responses <- c("I recommend this brand all of the time", "I recommend it when people ask about it.")
dat[, advocates.this.brand := 1*(recommend_response %in% recommendation.responses)]

This converts the text into numeric values, where 1 represents an advocate and 0 is for everyone else.

An Annotated Survey

Learning from the Annotated Survey

A Look at the Data

dat <- fread(input = "Simulated Marketing Data -- Original Form.csv")
dim(dat)
[1] 100000    374

Some Column Names

datatable(data = data.table(names = names(dat)))

Patterns in Column Names

the.pieces <- strsplit(x = names(dat), split = "_")
first.pieces <- lapply(X = the.pieces, FUN = function(x) {
    return(x[1])
})
the.patterns <- unique(as.character(first.pieces))
print(the.patterns)
 [1] "id"            "age"           "gender"        "income"       
 [5] "region"        "persona"       "Awareness"     "BP"           
 [9] "Consideration" "Consumption"   "Satisfaction"  "Advocacy"     

Nonetheless, Numerous Problems to Address

We will investigate each of these issues more thoroughly.

Issue #1: Decrypting the Names

Lots of Columns

Improving the Naming Conventions

Using numbers in place of the names of the products and the traits can create confusion and errors.

It is better to maintain an unambiguous link between the products, traits, and resulting data.

A better variable name: BP_14_11 becomes Mousse_Malt_Magic_Relaxing_0_10.

Creating a Products File

The products and their indices were listed in the annotated survey, but this information was nowhere to be found in the actual data set.

I created my own short file called products.csv.

products <- fread(input = "products.csv")
datatable(data = products, rownames = FALSE)

Creating a Traits File

I created a file called brand perception traits.csv.

bp.traits <- fread(input = "brand perception traits.csv")
datatable(data = bp.traits, rownames = FALSE)

Converting the Variables’ Names

change.bp.variable.name <- function(the.names, products, 
    bp.traits, prefix = "BP_") {
    require(plyr)
    require(data.table)
    
    short.names <- gsub(pattern = prefix, replacement = "", 
        x = the.names)
    new.names <- character(length(short.names))
    the.pieces <- as.data.table(t(as.data.table(strsplit(x = short.names, 
        split = "_", fixed = TRUE))))
    setnames(x = the.pieces, old = names(the.pieces), new = c("Trait", 
        "Product"))
    
    the.trait <- mapvalues(x = the.pieces[, Trait], from = bp.traits[, 
        Number], to = bp.traits[, Name], warn_missing = FALSE)
    
    the.product <- mapvalues(x = the.pieces[, Product], 
        from = products[, Number], to = products[, Name], 
        warn_missing = FALSE)
    
    new.names <- sprintf("%s_%s_0_10", the.product, the.trait)
    
    return(new.names)
}

Displaying the New Variable Names

bp.variables <- names(dat)[grep(pattern = "BP_", x = names(dat))]
new.bp.names <- change.bp.variable.name(the.names = bp.variables, 
    products = products, bp.traits = bp.traits)
print(new.bp.names[1:12])
 [1] "Brownie Pops_A brand for me._0_10"    
 [2] "Browniemint Bark_A brand for me._0_10"
 [3] "Caked On_A brand for me._0_10"        
 [4] "Chip Strips_A brand for me._0_10"     
 [5] "Chippy Cheese_A brand for me._0_10"   
 [6] "Choco Loco_A brand for me._0_10"      
 [7] "Cocoa Bears_A brand for me._0_10"     
 [8] "Cookie Crumble_A brand for me._0_10"  
 [9] "Fig Out!_A brand for me._0_10"        
[10] "Frostipops_A brand for me._0_10"      
[11] "Frozen Frogurt_A brand for me._0_10"  
[12] "Gummi Beans_A brand for me._0_10"     

Cleaning Up the New Variable Names

variable.name.cleanup <- function(x, removal.characters = c(",", 
    ".", "!")) {
    for (i in 1:length(x)) {
        x[i] <- gsub(pattern = " ", replacement = "_", x = x[i], 
            fixed = TRUE)
        for (j in 1:length(removal.characters)) {
            x[i] <- gsub(pattern = removal.characters[j], 
                replacement = "", x = x[i], fixed = TRUE)
        }
    }
    return(x)
}
new.bp.names <- variable.name.cleanup(x = new.bp.names)

The Cleaned Up Names

print(new.bp.names[1:12])
 [1] "Brownie_Pops_A_brand_for_me_0_10"    
 [2] "Browniemint_Bark_A_brand_for_me_0_10"
 [3] "Caked_On_A_brand_for_me_0_10"        
 [4] "Chip_Strips_A_brand_for_me_0_10"     
 [5] "Chippy_Cheese_A_brand_for_me_0_10"   
 [6] "Choco_Loco_A_brand_for_me_0_10"      
 [7] "Cocoa_Bears_A_brand_for_me_0_10"     
 [8] "Cookie_Crumble_A_brand_for_me_0_10"  
 [9] "Fig_Out_A_brand_for_me_0_10"         
[10] "Frostipops_A_brand_for_me_0_10"      
[11] "Frozen_Frogurt_A_brand_for_me_0_10"  
[12] "Gummi_Beans_A_brand_for_me_0_10"     

Changing the Names of the Variables in the data.table

setnames(x = dat, old = bp.variables, new = new.bp.names)
datatable(data = dat[1:5, .SD, .SDcols = grep(pattern = "Cookie_Crumble_", 
    x = names(dat))], rownames = FALSE)

Examining A Brand Perception Variable

tab <- dat[, .N, keyby = "Cookie_Crumble_Tastes_great_0_10"]
datatable(data = tab, rownames = FALSE)

Issue #2: Cleaning the Values of the Variables

The values of the Brand Perceptions of Cookie Crumble appear to be numeric. However, there are character values for:

We want to place these values on a numeric scale to make them more amenable to analyses:

Numeric Conversions

convert.bp.to.numeric <- function(x){
  x[x == "0: Strongly Disagree"] <- "0"
  x[x == "10: Strongly Agree"] <- "10"
  
  return(as.numeric(x))
}
dat <- dat[, (new.bp.names) := lapply(X = .SD, FUN = "convert.bp.to.numeric"), .SDcols = new.bp.names]

Checking the Conversion

new.tab <- dat[, .N, keyby = "Cookie_Crumble_Tastes_great_0_10"]
datatable(data = new.tab, rownames = FALSE)

Issue #3: The Data’s Peculiar Structure

The data covers many products and variables. There are separate columns for each pair of a product and a product-specific trait.

The Full Extent of the Problem.

The ultimate goal of the project was to create analyses and models for each product, category, and subgroup, and to display everything in a dynamic interface that would allow the marketing team to access information easily and interactively. But, with this kind of data, how could we pull out the information they needed in a reasonable way?

What To Do?

Option 1: Dynamic Extraction

Dynamic Extraction’s Mechanics

datatable(data = dat[1:5], rownames = FALSE)
  1. Select a brand.
  2. Map the brand to its corresponding number.
  3. Extract the appropriate columns of every brand-specific variable into a new data set.

Example of Dynamic Extraction

extract.brand.data <- function(dat, the.brand, products) {
    the.number <- products[Name == the.brand, Number]
    brand.dat <- dat[, .(id, age, Awareness = get(sprintf("Awareness_%d", 
        the.number)), Consideration = get(sprintf("Consideration_%d", 
        the.number)))]
    return(brand.dat)
}
cookie.crumble.dat <- extract.brand.data(dat = dat, the.brand = "Cookie Crumble", 
    products = products)
datatable(data = cookie.crumble.dat[1:5, ], rownames = FALSE)

Advantages of Dynamic Extraction

Disadvantages of Dynamic Extraction

There are a number of drawbacks to this approach:

For example, suppose you would like to aggregated all of the Premium products. This would involve dynamically extracting the data for each individual product in the Premium category and then binding all of these separate data sets into one larger one. That’s a lot of processing every time you want to start an analysis!

That’s Exactly What I Tried To Do

Option 2: Separate Data for Each Product

The Mechanics of Separating the Data

separate.product.data <- function(dat, products) {
    library(data.table)
    for (i in 1:products[, .N]) {
        product.dat <- extract.brand.data(data = dat, the.brand = products[i, 
            Name], products = products)
        fwrite(x = product.dat, file = sprintf("%s Data.csv", 
            products[i, Name]))
    }
}

Now we have the data for each individual product stored in separate CSV files.

Fundamentally, this approach requires pre-processing of the data. All of the clean-up, transformations, and extractions are done at an earlier stage. Then, for the modeling and reporting, only the data from the selected products would be loaded.

Advantages of Separate Data Sets

Disadvantages of Separate Data Sets

Creating separate files can have a number of drawbacks:

Option 3: Restructuring the Data

Mechanics of Restructuring

Starting with a data set that has n survey respondents and k products:

  1. Classify your variables as person-specific or brand-specific.

  2. Create a new data structure with \(n*k\) rows.

  3. Fill in the person-specific variables.

  4. Aggregate the brand-specific columns into a single variable.

Fundamentally, the restructured form would change the unit of observation from a single person’s responses for all of the products to that of a single person’s response to a single product.

Step 1: Classifying the Variables

Identify all of the variables that are person-specific:

Identify all of the variables that are brand-specific:

Step 2: Create a New Data Structure

Create a new data structure with \(n*k\) rows:

Before filling in the values of this new structure, it might look something like this:

Step 3: Fill in the Person-Specific Variables

Step 4: Aggregate the Brand-Specific Variables

Some Vocabulary for Restructuring Data Sets

Melting Data – the Laborious Way

As an example, we will create new columns for Awareness and Consideration that combine the multiple columns across the first two products.

n <- dat[, .N]
k <- products[1:2, .N]
mdat <- data.table(id = rep(x = dat[, id], times = k), age = rep(x = dat[, age], times = k))
mdat[, Product := ""]
mdat[, Awareness := numeric(n)]
mdat[, Consideration := numeric(n)]
for(i in 1:k){
  mdat[(i-1)*n + 1:n, Product := rep(products[i, Name])]
  mdat[(i-1)*n + 1:n, Awareness := dat[, get(sprintf("Awareness_%d", products[i, Number]))]]
  mdat[(i-1)*n + 1:n, Consideration := dat[, get(sprintf("Consideration_%d", products[i, Number]))]]
}

Melting Data – The Efficient Way

id.vars <- c("id", "age")
measure.vars <- list(Awareness = sprintf("Awareness_%d", 1:k), Consideration = sprintf("Consideration_%d", 1:k))
mdat <- melt(data = dat, id.vars = id.vars, measure.vars = measure.vars, variable.name = "Product", value.name = c("Awareness", "Consideration"))
mdat[, Product := mapvalues(x = Product, from = products[, Number], to = products[, Name], warn_missing = FALSE)]
datatable(data = mdat[1:100], rownames = FALSE)

Advantages of Melting a Wide Data Set

Disadvantages of Melting

Example: Product Awareness

What percentage of respondents are aware of Cookie Crumble?

mdat[Product == "Cookie Crumble", sprintf("%.2f%%", 100 * 
    mean(Awareness))]
[1] "75.68%"

This is considerably easier than mapping from a product name to its number, finding the corresponding column for Awareness, and then performing the computation. The melted structure of the data is more amenable to analysis.

Example: Product Consideration

What percentage of respondents have considered Cookie Crumble and Sweet Saltines in age groups of 18-34, 35-49, 50-65, and 65+?

library(Hmisc)
mdat[, `:=`(age_bracket, cut2(x = age, cuts = c(18, 35, 
    50, 65, 120)))]
tab = mdat[, .(Consideration = round(x = 100 * mean(Consideration, 
    na.rm = TRUE), digits = 2)), keyby = c("Product", "age_bracket")]
datatable(data = tab, rownames = FALSE)

Guidelines for Wide, Messy Data

For a data set with many products and separate columns for each product, melting the data can be advantageous. Here are some general recommendations:

Customer Engagement

Hidden Biases

If you are willing to accept the assumptions in your sampling methodology, then you can analyze the data to answer questions about your customers.

States of Engagement

A Progression

Awareness

“Have you ever heard of this product?”

Consideration

“Would you consider using this product?”

Consumption

This is a question of whether a respondent has used the product. Consumption is considered so important that the types of use are often categorized.

Satisfaction

This could also be a binary measure:

Loyalty

Advocacy

A measure of whether the respondent recommends the product to others – and how strongly.

I recommend the product:

Different combinations of selections could be used to measure:

Businesses Seek Out Advocates

Subgroup Analyses

Creating Personas for Customers

Clustering the Customers

Assigning Personas to Respondents

The Selected Personas for Snack Food Customers

Customer Engagement by Persona

The Story of Cookie Crumble

The following story is based on the real data that I saw – rather than what was simulated for the file used in this lecture.

Recent Consumption by Persona

Satisfaction by Persona

Advocacy by Persona

The Best Customers

Awareness by Persona

A Major Finding

Or So We Thought

Checking Other Products

How Can We Resolve This Inconsistency?

More to Come

With hundreds of products and thousands of variables, we have only begun to dig into the kinds of insights we can produce by examining surveyed data.